Machine Learning

IMD0104 - Aprendizado de máquina

Professor Dr. João Carlos

Team

Marco Olimpio - marco.olimpio at gmail.com
Rebecca Betwel - bekbetwel at gmail.com

First, lets configure the notebook and load the dataset

Configuring the environment



In [81]:

    
!pip install cufflinks









    



Downloading/unpacking cufflinks
  Cannot fetch index base URL https://pypi.python.org/simple/
  Could not find any downloads that satisfy the requirement cufflinks
Cleaning up...
No distributions at all found for cufflinks
Storing debug log for failure in /Users/marco/Library/Logs/pip.log



In [11]:

    
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
plt.style.use('ggplot')

import seaborn as sns # for making plots with seaborn
color = sns.color_palette()
sns.set(rc={'figure.figsize':(25,15)})

import plotly
plotly.offline.init_notebook_mode(connected=True)
import plotly.graph_objs as go

import plotly.figure_factory as ff
#import cufflinks as cf


import warnings
warnings.filterwarnings('ignore')

%matplotlib inline









    











    



---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-11-060456e53df4> in <module>()
     13 
     14 import plotly.figure_factory as ff
---> 15 import cufflinks as cf
     16 
     17 

ModuleNotFoundError: No module named 'cufflinks'

Loading the dataset for the assignment

The dataset was published in a comma separated value format, so utilizing the function read_csv we could load the dataset! Below we have a sample of the original data, ten tuples.



In [164]:

    
df = pd.read_csv('./db/googleplaystore.csv')

So, according to the description of the dataset as well as the descriptions of the columns we do have a dataset with information about the Google Play Store, we could apply many algorithms since we do have a great variety of tuples, more than 10k, and 13 features.

1. Introduction

Our consulting firm has received a request to discover the most probable kind of mobile app that our client should concentrate efforts to maximize the return on investment. As the first glimpse, we considered the data from Google Play platform to make this study where were considered the lower break in point in terms of mobile products.

Therefore, we have acquired a very unique dataset with more than ten thousand instances and more than ten features from Google Store Platform to make the analysis of the next killer app.

1.1 The dataset

The dataset is a collection of Google Play Store and there is a good variety of information about a number of installations, the genre of the application, the reviews and other features making 13 features total and we do have about eleven thousand information of apps from the mobile platform.

So, now, let's embrace what data we do have in the dataset, let's take a sample below:



In [165]:

    
df.sample(10)









    Out[165]:






  
    
      
      App
      Category
      Rating
      Reviews
      Size
      Installs
      Type
      Price
      Content Rating
      Genres
      Last Updated
      Current Ver
      Android Ver
    
  
  
    
      133
      Dresses Ideas & Fashions +3000
      BEAUTY
      4.5
      473
      8.2M
      100,000+
      Free
      0
      Mature 17+
      Beauty
      March 1, 2017
      1.0.2.0
      1.6 and up
    
    
      10050
      Advanced EX for NISSAN
      TOOLS
      2.9
      164
      144k
      5,000+
      Paid
      $4.99
      Everyone
      Tools
      March 14, 2015
      1.3
      1.6 and up
    
    
      4941
      ac remote control
      TOOLS
      3.7
      9514
      7.0M
      500,000+
      Free
      0
      Everyone
      Tools
      July 11, 2018
      acremotecontrol-v7
      4.0.3 and up
    
    
      6527
      BN Pro Battery Level-White
      LIBRARIES_AND_DEMO
      NaN
      21
      200k
      5,000+
      Free
      0
      Everyone
      Libraries & Demo
      February 5, 2017
      2.3.2
      1.6 and up
    
    
      4986
      Alchemy Classic Ad Free
      FAMILY
      4.6
      20178
      9.0M
      100,000+
      Free
      0
      Everyone
      Puzzle
      May 26, 2014
      1.7.3
      1.6 and up
    
    
      2343
      Migraine Buddy - The Migraine and Headache tra...
      MEDICAL
      4.7
      26862
      30M
      500,000+
      Free
      0
      Everyone
      Medical
      June 25, 2018
      25.3.1
      4.0.3 and up
    
    
      3445
      ai.type Free Emoji Keyboard
      PERSONALIZATION
      4.3
      647844
      Varies with device
      10,000,000+
      Free
      0
      Everyone
      Personalization
      July 24, 2018
      Varies with device
      Varies with device
    
    
      5930
      Let's Learn Alif Ba Ta
      FAMILY
      4.6
      32
      60M
      10,000+
      Free
      0
      Everyone
      Education
      November 29, 2017
      6
      4.0.3 and up
    
    
      5862
      Miami crime simulator
      GAME
      4.0
      254518
      100M
      10,000,000+
      Free
      0
      Mature 17+
      Action
      July 9, 2018
      2.0
      4.0 and up
    
    
      3185
      Fly Delta
      TRAVEL_AND_LOCAL
      3.7
      27560
      46M
      5,000,000+
      Free
      0
      Everyone
      Travel & Local
      July 31, 2018
      4.13.2
      5.0 and up

we could check that our dataset presents 10.841 tuples



In [166]:

    
len(df)









    Out[166]:





10841

and 13 columns/features



In [167]:

    
df.columns









    Out[167]:





Index(['App', 'Category', 'Rating', 'Reviews', 'Size', 'Installs', 'Type',
       'Price', 'Content Rating', 'Genres', 'Last Updated', 'Current Ver',
       'Android Ver'],
      dtype='object')

Where each column/feature represents:

App - The name of the application Category - The category of the application Rating - The rate given by the users Reviews - The number of reviews given by the users Size - The size of the application Installs - Number of installs Type - If the application is paid or free Price - The price charged Content Rating - The age rating Genres - The kind of category of the application Last updated - The date when it was last updated Current ver - The current version number, the last version number Android ver - The minimum android compatible version Now let's see what each dataset feature presents in terms of its data, like the range of the data, the categories it has and etc.

1.2 Data exploration

1.2.1 Pre-processing

In this section, we try to adjust de data from our dataset. Why do we need to do that? Because the datasets usually come with some problems, like missing data in some tuples, data that make no sense (like 100 sons, or last updated in 1900 when the data recording was after 2000), and those problems can make the results go wrong.

So what do we need to do? First step to proceed is to:

remove all missing data;
check for outliers; and
normalize our data

All these steps are following the assignment description.

But we need to put more effort into this dataset, like

removing duplicate tuples/instances, in this case, duplicate apps
check for strange data, for instance, the Installs feature should only appear with 'Free' or 'Paid'
check Android version to be a valid number
convert size for MB
remove strange characters
cast data to the appropriate data type, eg. string to int;



In [168]:

    
df.drop_duplicates(subset='App', inplace=True)

df = df[df['Android Ver'] != np.nan]
df = df[df['Android Ver'] != 'NaN']

df = df[df['Installs'] != 'Free']
df = df[df['Installs'] != 'Paid']

df['Installs'] = df['Installs'].apply(lambda x: x.replace('+', '') if '+' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x: x.replace(',', '') if ',' in str(x) else x)
df['Installs'] = df['Installs'].apply(lambda x: int(x))
df['Installs'] = df['Installs'].apply(lambda x: float(x))

df['Size'] = df['Size'].apply(lambda x: str(x).replace('Varies with device', 'NaN') if 'Varies with device' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: str(x).replace('M', '') if 'M' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: str(x).replace(',', '') if 'M' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: float(str(x).replace('k', '')) / 1000 if 'k' in str(x) else x)
df['Size'] = df['Size'].apply(lambda x: float(x))

df['Reviews'] = df['Reviews'].apply(lambda x: int(x))

df['Price'] = df['Price'].apply(lambda x: str(x).replace('$', '') if '$' in str(x) else str(x))
df['Price'] = df['Price'].apply(lambda x: float(x))

So, analyzing the number of 'missing' data we have



In [169]:

    
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(6)









    Out[169]:






  
    
      
      Total
      Percent
    
  
  
    
      Rating
      1463
      0.151465
    
    
      Size
      1227
      0.127032
    
    
      Current Ver
      8
      0.000828
    
    
      Android Ver
      2
      0.000207
    
    
      Type
      1
      0.000104
    
    
      Last Updated
      0
      0.000000

And we need to take action about this situation dropping all instances with missing values from our dataset



In [170]:

    
df.dropna(how ='any', inplace = True)

Checking the result of this operation we conclude with less number of instances:



In [171]:

    
total = df.isnull().sum().sort_values(ascending=False)
percent = (df.isnull().sum()/df.isnull().count()).sort_values(ascending=False)
missing_data = pd.concat([total, percent], axis=1, keys=['Total', 'Percent'])
missing_data.head(6)









    Out[171]:






  
    
      
      Total
      Percent
    
  
  
    
      Android Ver
      0
      0.0
    
    
      Current Ver
      0
      0.0
    
    
      Last Updated
      0
      0.0
    
    
      Genres
      0
      0.0
    
    
      Content Rating
      0
      0.0
    
    
      Price
      0
      0.0



In [172]:

    
print(len(df))

Fiding

Now we have only 7021 instances, or, a reduction of about:



In [173]:

    
10841 - 7021









    Out[173]:





3820



In [174]:

    
df["Price"].describe()









    Out[174]:





count    7021.000000
mean        1.174222
std        18.205355
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       400.000000
Name: Price, dtype: float64



In [175]:

    
x = df['Rating'].dropna()
z = df['Installs'][df.Installs!=0].dropna()
p = df['Reviews'][df.Reviews!=0].dropna()
t = df['Type'].dropna()
price = df['Price']

p = sns.pairplot(pd.DataFrame(list(zip(x, np.log(z), np.log10(p), t, price)), 
                        columns=['Rating', 'Installs', 'Reviews', 'Type', 'Price']), hue='Type', palette="Set2")



In [74]:

    
print("\n", df['Category'].unique())









    



 ['ART_AND_DESIGN' 'AUTO_AND_VEHICLES' 'BEAUTY' 'BOOKS_AND_REFERENCE'
 'BUSINESS' 'COMICS' 'COMMUNICATION' 'DATING' 'EDUCATION' 'ENTERTAINMENT'
 'EVENTS' 'FINANCE' 'FOOD_AND_DRINK' 'HEALTH_AND_FITNESS' 'HOUSE_AND_HOME'
 'LIBRARIES_AND_DEMO' 'LIFESTYLE' 'GAME' 'FAMILY' 'MEDICAL' 'SOCIAL'
 'SHOPPING' 'PHOTOGRAPHY' 'SPORTS' 'TRAVEL_AND_LOCAL' 'TOOLS'
 'PERSONALIZATION' 'PRODUCTIVITY' 'PARENTING' 'WEATHER' 'VIDEO_PLAYERS'
 'NEWS_AND_MAGAZINES' 'MAPS_AND_NAVIGATION']



In [176]:

    
print(df["Genres"].unique())









    



['Art & Design' 'Art & Design;Pretend Play' 'Art & Design;Creativity'
 'Auto & Vehicles' 'Beauty' 'Books & Reference' 'Business' 'Comics'
 'Comics;Creativity' 'Communication' 'Dating' 'Education'
 'Education;Creativity' 'Education;Education'
 'Education;Action & Adventure' 'Education;Pretend Play'
 'Education;Brain Games' 'Entertainment' 'Entertainment;Brain Games'
 'Entertainment;Music & Video' 'Events' 'Finance' 'Food & Drink'
 'Health & Fitness' 'House & Home' 'Libraries & Demo' 'Lifestyle'
 'Lifestyle;Pretend Play' 'Adventure;Action & Adventure' 'Arcade' 'Casual'
 'Card' 'Casual;Pretend Play' 'Strategy' 'Action' 'Puzzle' 'Sports' 'Word'
 'Racing' 'Casual;Creativity' 'Simulation' 'Adventure' 'Board' 'Trivia'
 'Role Playing' 'Simulation;Education' 'Action;Action & Adventure'
 'Casual;Brain Games' 'Simulation;Action & Adventure'
 'Educational;Creativity' 'Puzzle;Brain Games' 'Educational;Education'
 'Educational;Brain Games' 'Educational;Pretend Play'
 'Casual;Action & Adventure' 'Entertainment;Education' 'Casual;Education'
 'Music;Music & Video' 'Arcade;Pretend Play' 'Simulation;Pretend Play'
 'Puzzle;Creativity' 'Racing;Action & Adventure'
 'Educational;Action & Adventure' 'Arcade;Action & Adventure'
 'Entertainment;Action & Adventure' 'Puzzle;Action & Adventure'
 'Role Playing;Action & Adventure' 'Strategy;Action & Adventure'
 'Music & Audio;Music & Video' 'Health & Fitness;Education'
 'Adventure;Education' 'Board;Brain Games' 'Board;Action & Adventure'
 'Board;Pretend Play' 'Casual;Music & Video' 'Education;Music & Video'
 'Role Playing;Pretend Play' 'Entertainment;Pretend Play'
 'Video Players & Editors;Creativity' 'Card;Action & Adventure' 'Medical'
 'Social' 'Shopping' 'Photography' 'Travel & Local'
 'Travel & Local;Action & Adventure' 'Tools' 'Personalization'
 'Productivity' 'Parenting' 'Parenting;Brain Games' 'Parenting;Education'
 'Parenting;Music & Video' 'Weather' 'Video Players & Editors'
 'News & Magazines' 'Maps & Navigation'
 'Health & Fitness;Action & Adventure' 'Music' 'Educational' 'Casino'
 'Adventure;Brain Games' 'Video Players & Editors;Music & Video'
 'Entertainment;Creativity' 'Sports;Action & Adventure'
 'Books & Reference;Education' 'Puzzle;Education'
 'Role Playing;Brain Games' 'Strategy;Education' 'Racing;Pretend Play'
 'Strategy;Creativity']

The feature 'Price' is one of the columns that need some normalization. As we can see below, we have a mean of $1,17 per app and the data is very skewed. Below we plotted a chart with the distribution of this



In [177]:

    
from scipy import stats
sns.distplot(df["Price"], kde=False, fit=stats.norm);



In [89]:

    
df["Price"].describe()









    Out[89]:





count    7021.000000
mean        1.174222
std        18.205355
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max       400.000000
Name: Price, dtype: float64



In [178]:

    
dfmod = df
dfmod["Price"] = (df["Price"] - df["Price"].min()) / (df["Price"].max() - df["Price"].min())



In [179]:

    
dfmod["Price"].describe()









    Out[179]:





count    7021.000000
mean        0.002936
std         0.045513
min         0.000000
25%         0.000000
50%         0.000000
75%         0.000000
max         1.000000
Name: Price, dtype: float64

So as we can see below we now have a normalized data for the price in another DataFrame following the same histogram. As Price, there is other information that we need to check like the number of installs, rating, reviews, and size.



In [180]:

    
sns.distplot(dfmod["Price"], kde=False, fit=stats.norm);

About the information of 'Installs' we have a categorized information, dividing the number of downloads in chunks as we can notice below



In [181]:

    
%%capture
dfmod["Installs"] = (df["Installs"] - df["Installs"].min()) / (df["Installs"].max() - df["Installs"].min())

Let's see how the distribution of the rating values is presented. So, we do have a mean really high of 4.16 in ratings' data and a standard deviation of around 0.56. Notice that the minimum note is 1 and that the first quartile is about 4 in ratings, in other words, we do have a skewed data with a long left long tail. We already normalized this feature.



In [182]:

    
print(dfmod["Rating"].describe())

sns.distplot(df["Rating"], kde=False, fit=stats.norm);









    



count    7021.000000
mean        4.160704
std         0.559241
min         1.000000
25%         4.000000
50%         4.300000
75%         4.500000
max         5.000000
Name: Rating, dtype: float64



In [158]:

    
sns.distplot(df["Installs"], rug=True, hist=False)









    Out[158]:





<matplotlib.axes._subplots.AxesSubplot at 0x148690cf8>

If we going to utilize the method for classification we should classify based on a label. For this purpose, we will consider, based on the distribution of the rating. So, we'll propose the following



In [183]:

    
print(df["Rating"].describe())
print("\n", df['Rating'].unique())
sns.distplot(df["Rating"], hist=True)

dfmod.loc[(dfmod['Rating'] >= 0.0 ) & (dfmod['Rating'] <= 4.25 ), 'label_rating']   = '0 bad'
dfmod.loc[(dfmod['Rating'] > 4.25 ) & (dfmod['Rating'] <= 4.75 ), 'label_rating']   = '1 normal'
dfmod.loc[(dfmod['Rating'] > 4.75), 'label_rating']  = '2 good'
print(dfmod['label_rating'].unique())









    



count    7021.000000
mean        4.160704
std         0.559241
min         1.000000
25%         4.000000
50%         4.300000
75%         4.500000
max         5.000000
Name: Rating, dtype: float64

 [ 4.1  3.9  4.7  4.5  4.3  4.4  3.8  4.2  4.6  4.   4.8  4.9  3.6  3.7  3.2
  3.3  3.4  3.5  3.1  5.   2.6  3.   1.9  2.5  2.8  2.7  1.   2.9  2.3  2.2
  1.7  2.   1.8  2.4  1.6  2.1  1.4  1.5  1.2]
['0 bad' '1 normal' '2 good']

Now it's after we categorize the new label, we a ready to normalize the data from the rating. Remembering that we are normalizing the data from 0-1.



In [185]:

    
dfmod["Rating"] = (df["Rating"] - df["Rating"].min()) / (df["Rating"].max() - df["Rating"].min())
sns.distplot(dfmod["Rating"], hist=True)









    Out[185]:





<matplotlib.axes._subplots.AxesSubplot at 0x1531c4160>

We should also normalize reviews and price data



In [192]:

    
dfmod["Installs"] = (df["Installs"] - df["Installs"].min()) / (df["Installs"].max() - df["Installs"].min())
dfmod["Price"] = (df["Price"] - df["Price"].min()) / (df["Price"].max() - df["Price"].min())
dfmod["Reviews"] = (df["Reviews"] - df["Reviews"].min()) / (df["Reviews"].max() - df["Reviews"].min())

dfmod[['Rating', 'Installs', 'Price', 'Reviews', 'label_rating' ]]









    Out[192]:






  
    
      
      Rating
      Installs
      Price
      Reviews
      label_rating
    
  
  
    
      0
      0.775
      9.999000e-06
      0.0
      3.519580e-06
      0 bad
    
    
      1
      0.725
      4.999990e-04
      0.0
      2.151844e-05
      0 bad
    
    
      2
      0.925
      4.999999e-03
      0.0
      1.949335e-03
      1 normal
    
    
      3
      0.875
      5.000000e-02
      0.0
      4.803625e-03
      1 normal
    
    
      4
      0.825
      9.999900e-05
      0.0
      2.151844e-05
      1 normal
    
    
      5
      0.850
      4.999900e-05
      0.0
      3.697786e-06
      1 normal
    
    
      6
      0.700
      4.999900e-05
      0.0
      3.942820e-06
      0 bad
    
    
      7
      0.775
      9.999990e-04
      0.0
      8.200621e-04
      0 bad
    
    
      8
      0.850
      9.999990e-04
      0.0
      3.071836e-04
      1 normal
    
    
      9
      0.925
      9.999000e-06
      0.0
      2.673099e-06
      1 normal
    
    
      10
      0.850
      9.999990e-04
      0.0
      3.091661e-04
      1 normal
    
    
      11
      0.850
      9.999990e-04
      0.0
      1.957376e-04
      1 normal
    
    
      12
      0.800
      9.999999e-03
      0.0
      9.985805e-04
      0 bad
    
    
      13
      0.900
      9.999900e-05
      0.0
      9.634293e-05
      1 normal
    
    
      14
      0.850
      9.999900e-05
      0.0
      3.379242e-05
      1 normal
    
    
      16
      0.925
      4.999990e-04
      0.0
      8.088351e-05
      1 normal
    
    
      17
      0.875
      9.999000e-06
      0.0
      5.791714e-07
      1 normal
    
    
      18
      0.825
      4.999999e-03
      0.0
      4.326299e-03
      1 normal
    
    
      19
      0.900
      9.999999e-03
      0.0
      4.998650e-03
      1 normal
    
    
      20
      0.750
      9.999900e-05
      0.0
      1.000184e-05
      0 bad
    
    
      21
      0.775
      9.999900e-05
      0.0
      1.454611e-05
      0 bad
    
    
      22
      0.925
      4.999990e-04
      0.0
      1.714793e-04
      1 normal
    
    
      24
      0.925
      4.999900e-05
      0.0
      2.606271e-06
      1 normal
    
    
      25
      0.950
      9.999000e-06
      0.0
      4.254682e-06
      2 good
    
    
      26
      0.925
      4.999990e-04
      0.0
      4.512859e-04
      1 normal
    
    
      27
      0.775
      9.999900e-05
      0.0
      4.499716e-06
      0 bad
    
    
      28
      0.725
      9.999000e-06
      0.0
      3.007236e-06
      0 bad
    
    
      29
      0.775
      9.999900e-05
      0.0
      4.945232e-06
      0 bad
    
    
      30
      0.800
      9.999900e-05
      0.0
      2.492664e-05
      0 bad
    
    
      31
      0.775
      4.999900e-05
      0.0
      5.034336e-06
      0 bad
    
    
      ...
      ...
      ...
      ...
      ...
      ...
    
    
      10792
      0.800
      9.999990e-04
      0.0
      4.824943e-04
      0 bad
    
    
      10793
      0.850
      4.999990e-04
      0.0
      6.350614e-04
      1 normal
    
    
      10795
      0.800
      9.999900e-05
      0.0
      1.634600e-04
      0 bad
    
    
      10796
      0.925
      9.999990e-04
      0.0
      1.368716e-03
      1 normal
    
    
      10797
      0.900
      9.999990e-04
      0.0
      7.224495e-04
      1 normal
    
    
      10799
      0.925
      9.999900e-05
      0.0
      4.533130e-05
      1 normal
    
    
      10800
      0.775
      4.999000e-06
      0.0
      3.853717e-06
      0 bad
    
    
      10801
      0.975
      9.990000e-07
      0.0
      1.136067e-06
      2 good
    
    
      10802
      0.750
      9.999000e-06
      0.0
      4.098751e-06
      0 bad
    
    
      10803
      0.825
      9.999990e-04
      0.0
      1.258473e-03
      1 normal
    
    
      10804
      0.800
      9.999900e-05
      0.0
      1.212027e-04
      0 bad
    
    
      10805
      0.825
      9.900000e-08
      0.0
      4.455164e-08
      1 normal
    
    
      10809
      0.925
      9.999990e-04
      0.0
      8.380654e-03
      1 normal
    
    
      10810
      0.950
      9.900000e-08
      0.0
      4.009648e-07
      2 good
    
    
      10812
      0.775
      9.990000e-07
      0.0
      1.759790e-06
      0 bad
    
    
      10814
      0.750
      4.999900e-05
      0.0
      1.746424e-05
      0 bad
    
    
      10815
      0.800
      4.999990e-04
      0.0
      1.286206e-04
      0 bad
    
    
      10817
      0.750
      9.999900e-05
      0.0
      1.969183e-05
      0 bad
    
    
      10819
      0.575
      4.999000e-06
      0.0
      1.136067e-06
      0 bad
    
    
      10820
      1.000
      9.990000e-07
      0.0
      4.677923e-07
      2 good
    
    
      10827
      0.800
      4.999000e-06
      0.0
      2.583995e-06
      0 bad
    
    
      10828
      0.600
      9.999000e-06
      0.0
      6.459988e-06
      0 bad
    
    
      10829
      0.900
      9.999000e-06
      0.0
      1.341004e-05
      1 normal
    
    
      10830
      0.700
      9.999900e-05
      0.0
      1.960272e-05
      0 bad
    
    
      10832
      0.700
      9.999900e-05
      0.0
      2.659733e-05
      0 bad
    
    
      10833
      0.950
      9.990000e-07
      0.0
      9.578603e-07
      2 good
    
    
      10834
      0.750
      4.990000e-07
      0.0
      1.336549e-07
      0 bad
    
    
      10836
      0.875
      4.999000e-06
      0.0
      8.242054e-07
      1 normal
    
    
      10837
      1.000
      9.900000e-08
      0.0
      6.682747e-08
      2 good
    
    
      10840
      0.875
      9.999999e-03
      0.0
      8.872593e-03
      1 normal
    
  

7021 rows × 5 columns



In [198]:

    
#sns.dfmod[['Rating', 'Installs', 'Price', 'Reviews', 'label_rating' ]]

#sns.pairplot(data = dfmod[['Rating', 'Installs', 'Price', 'Reviews', 'label_rating' ]])

x = dfmod['Rating']
z = dfmod['Installs']
p = dfmod['Reviews'][df.Reviews!=0].dropna()
t = dfmod['label_rating'].dropna()
price = df['Price']

p = sns.pairplot(pd.DataFrame(list(zip(x, z, p, t, price)), 
                        columns=['Rating', 'Installs', 'Reviews', 'label_rating', 'Price']), hue='label_rating', palette="Set2")

What this pair plot above shows us is how the mobile market works. The first only plot that we noticed was the $rating x install$ chart. In this chart we can notice that the highest ratings received are way less the number of installs, that same behavior can be observed with the $rating x reviews$ the third observation is over the $rating x price$ chart, in this chat we can conclude that there is no 'good' software with the high rating!

Observing the second row, third and fourth columns, we can see that the number of reviews is way low, it doesn't matter if its bad, normal or good and it also indicates a poor correlation between these two features. The $installs x price$ chart shows that there is no good paid app with a high number of installs and that there is a normal and bad software way above the mean of price in the google play store

2.Processing

As we are talking about a descriptive database we are going to explore unsupervised methods in this section. Notably the k-means algorithm and the Agglomerative Clustering.

2.1 k-Means Clustering

The k-means is an algorithm destined to a class of problems called unsupervised problems. This means that from the data itself we will try to clusterize the data or divide the dataset into clusters. So, we do not need labels or even the number of clusters. As in innumerous other methods, the k-means tries to minimize its cost function and this kind of algorithm is susceptible to local minima.

As proposed in the assignment description, we should consider a k for a minimum of two and maximum of $CEILING(log_2{n})$, where $n$ is the number of instances that we've in our dataset.

So, after all the pre-processing section we ended up with around 7k instances, it'll give us approximately $CEILING(log_2{7000}) = CEILING(12.7731392067)$ and we should round up this number to consider $13$ as the number of $k's$ that we should use in our experimentations.

So, making use of scikit-learn library we could code the experimentation with the k-means method using:



In [249]:

    
from sklearn.cluster import KMeans
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from sklearn.model_selection import cross_val_score, cross_val_predict, KFold
from sklearn import preprocessing, metrics
from sklearn.cluster import AgglomerativeClustering
from time import time

The k-Means algorithm coded in scikit-learn utilizes the Lloyd’s or Elkan’s algorithm [http://www.vlfeat.org/api/kmeans-fundamentals.html]. Have a complexity of O(k n T), k is the number of clusters, n is the number of samples e T the number of interactions of the algorithm.

The algorithm will do 15 times, from k = 3 to k = 18, x 3 times, the number of different seeds asked by the assignment description, values 37,110 and 777. We're using a parallel with 4 jobs, n_jobs, to execute the experimentations. In this case, we going to use 300 iterations per execution of the algorithm, the default value and using the k-Means++ implementation.



In [250]:

    
seed_initialization = [37,110,777]
n_jobs = 4
sample_size_ = 300

dataclust = dfmod[['Rating', 'Installs', 'Price', 'Reviews' ]];
table_results = []
for i in range(3,13):
    for y in range(3):
        t0 = time()
        estimator = KMeans(init='k-means++', n_clusters=i, random_state = seed_initialization[y], n_jobs = n_jobs)
        estimator.fit(dataclust)
        
        print(time()-t0)
        print("k: " + str(i))
        print("seed: " + str(seed_initialization[y]))
        db = metrics.homogeneity_score(dfmod['label_rating'], estimator.labels_)
        si = metrics.silhouette_score(dataclust, estimator.labels_, metric='euclidean',sample_size=sample_size_)
        result_estimator = {"kmeans", i, seed_initialization[y], db, si}
        table_results.append(result_estimator)
    estimator2 = AgglomerativeClustering(n_clusters=6, linkage='ward').fit(dataclust)
    db2 = metrics.homogeneity_score(dfmod['label_rating'], estimator.labels_)
    si2 = metrics.silhouette_score(dataclust, estimator.labels_, metric='euclidean',sample_size=sample_size_)
    result_estimator = {"AgglomerativeClustering", 6, "-", db2, si2}
    table_results.append(result_estimator)









    



1.9380829334259033
k: 3
seed: 37
1.9346320629119873
k: 3
seed: 110
1.9314000606536865
k: 3
seed: 777
2.139396905899048
k: 4
seed: 37
1.9550609588623047
k: 4
seed: 110
1.92824387550354
k: 4
seed: 777
1.9359238147735596
k: 5
seed: 37
1.9502747058868408
k: 5
seed: 110
1.9231138229370117
k: 5
seed: 777
1.9010429382324219
k: 6
seed: 37
1.9262487888336182
k: 6
seed: 110
1.9355158805847168
k: 6
seed: 777
1.942744255065918
k: 7
seed: 37
1.967836856842041
k: 7
seed: 110
1.9193339347839355
k: 7
seed: 777
1.9039840698242188
k: 8
seed: 37
1.9158520698547363
k: 8
seed: 110
1.972661018371582
k: 8
seed: 777
1.9400508403778076
k: 9
seed: 37
1.9295361042022705
k: 9
seed: 110
1.9539599418640137
k: 9
seed: 777
2.234902858734131
k: 10
seed: 37
1.9663441181182861
k: 10
seed: 110
1.923828125
k: 10
seed: 777
2.085798978805542
k: 11
seed: 37
2.3306097984313965
k: 11
seed: 110
2.3771450519561768
k: 11
seed: 777
1.9470901489257812
k: 12
seed: 37
1.9752357006072998
k: 12
seed: 110
2.48663592338562
k: 12
seed: 777
2.0290989875793457
k: 13
seed: 37
2.0195131301879883
k: 13
seed: 110
2.010448932647705
k: 13
seed: 777
1.9921541213989258
k: 14
seed: 37
2.146785020828247
k: 14
seed: 110
2.0735280513763428
k: 14
seed: 777
2.1823270320892334
k: 15
seed: 37
2.1856000423431396
k: 15
seed: 110
2.1774401664733887
k: 15
seed: 777
2.194180965423584
k: 16
seed: 37
2.0981719493865967
k: 16
seed: 110
2.1235289573669434
k: 16
seed: 777
2.097702980041504
k: 17
seed: 37
2.1302740573883057
k: 17
seed: 110
2.415310859680176
k: 17
seed: 777

3.Post processing

In this section we analyse the results of all tested algorithms.

All the metrics were collected in the same step as the clustering algorithms, in the last step. So we show the results



In [251]:

    
result = pd.DataFrame(table_results, columns=["DB", "Silhouete","K","Seed","Algorithm"])



In [252]:

    
result









    Out[252]:






  
    
      
      DB
      Silhouete
      K
      Seed
      Algorithm
    
  
  
    
      0
      0.467329
      0.520237
      3
      37
      kmeans
    
    
      1
      0.755287
      0.457037
      3
      110
      kmeans
    
    
      2
      0.755287
      0.505718
      3
      777
      kmeans
    
    
      3
      0.755287
      0.510779
      6
      AgglomerativeClustering
      -
    
    
      4
      0.756716
      0.489827
      4
      37
      kmeans
    
    
      5
      0.756716
      0.488631
      4
      110
      kmeans
    
    
      6
      0.466315
      0.529493
      4
      777
      kmeans
    
    
      7
      0.466315
      0.537279
      6
      AgglomerativeClustering
      -
    
    
      8
      0.523393
      0.508422
      37
      5
      kmeans
    
    
      9
      0.535047
      0.520671
      5
      110
      kmeans
    
    
      10
      0.444454
      0.488557
      5
      777
      kmeans
    
    
      11
      0.444454
      0.520593
      6
      AgglomerativeClustering
      -
    
    
      12
      0.603907
      0.518118
      37
      6
      kmeans
    
    
      13
      0.603684
      0.520023
      6
      110
      kmeans
    
    
      14
      0.521372
      0.527498
      6
      777
      kmeans
    
    
      15
      0.521372
      0.485241
      6
      AgglomerativeClustering
      -
    
    
      16
      0.603754
      0.514272
      37
      7
      kmeans
    
    
      17
      0.836479
      0.521544
      7
      110
      kmeans
    
    
      18
      0.521853
      0.500308
      7
      777
      kmeans
    
    
      19
      0.521853
      0.523872
      6
      AgglomerativeClustering
      -
    
    
      20
      0.898455
      0.483838
      37
      8
      kmeans
    
    
      21
      0.837333
      0.503442
      8
      110
      kmeans
    
    
      22
      0.772824
      0.513642
      8
      777
      kmeans
    
    
      23
      0.772824
      0.474615
      6
      AgglomerativeClustering
      -
    
    
      24
      0.793047
      0.530030
      37
      9
      kmeans
    
    
      25
      0.792869
      0.530092
      9
      110
      kmeans
    
    
      26
      0.792869
      0.548165
      9
      777
      kmeans
    
    
      27
      0.792869
      0.508424
      6
      AgglomerativeClustering
      -
    
    
      28
      0.793047
      0.496616
      37
      10
      kmeans
    
    
      29
      0.715091
      0.523217
      10
      110
      kmeans
    
    
      30
      0.792836
      0.550425
      777
      10
      kmeans
    
    
      31
      0.792836
      0.527656
      6
      AgglomerativeClustering
      -
    
    
      32
      0.792836
      0.537694
      37
      11
      kmeans
    
    
      33
      0.791581
      0.504579
      11
      110
      kmeans
    
    
      34
      0.800316
      0.512558
      777
      11
      kmeans
    
    
      35
      0.800316
      0.536532
      6
      AgglomerativeClustering
      -
    
    
      36
      0.792836
      0.533943
      37
      12
      kmeans
    
    
      37
      0.993257
      0.520890
      12
      110
      kmeans
    
    
      38
      0.792100
      0.524566
      777
      12
      kmeans
    
    
      39
      0.792100
      0.535201
      6
      AgglomerativeClustering
      -
    
    
      40
      0.763169
      0.578027
      37
      13
      kmeans
    
    
      41
      0.846885
      0.582054
      13
      110
      kmeans
    
    
      42
      0.764156
      0.542092
      777
      13
      kmeans
    
    
      43
      0.764156
      0.527338
      6
      AgglomerativeClustering
      -
    
    
      44
      0.800050
      0.539485
      37
      14
      kmeans
    
    
      45
      0.990822
      0.531511
      110
      14
      kmeans
    
    
      46
      0.763154
      0.563354
      777
      14
      kmeans
    
    
      47
      0.763154
      0.570020
      6
      AgglomerativeClustering
      -
    
    
      48
      0.932861
      0.515766
      37
      15
      kmeans
    
    
      49
      0.846900
      0.516821
      110
      15
      kmeans
    
    
      50
      0.761888
      0.561737
      777
      15
      kmeans
    
    
      51
      0.761888
      0.544145
      6
      AgglomerativeClustering
      -
    
    
      52
      0.799614
      0.533166
      37
      16
      kmeans
    
    
      53
      0.931410
      0.555577
      110
      16
      kmeans
    
    
      54
      0.539110
      0.843843
      777
      16
      kmeans
    
    
      55
      0.843843
      0.575253
      6
      AgglomerativeClustering
      -
    
    
      56
      0.931123
      0.569303
      37
      17
      kmeans
    
    
      57
      0.934003
      0.520858
      110
      17
      kmeans
    
    
      58
      0.843843
      0.555921
      777
      17
      kmeans
    
    
      59
      0.843843
      0.568797
      6
      AgglomerativeClustering
      -

3.1 Friedman test

The Friedman Test is a version of repeated-measures of analysis of variance (ANOVA) that can be executed on ranked data.

So, to do a Friedman's ANOVA, we should do:

Define null and alternative hypotheses
state alpha
calculate degrees of freedom
state decision rule
calculate test statistic
state results
state conclusion

Define null and alternative hypotheses $H_0$: there is no difference between the three conditions $H_1$: there is a difference between the three conditions

State alpha $\alpha$ $\alpha = 0.5$

Calculate degrees of freedom Using $df = k -1$ $df = 3 - 1 = 2$

State decision rule In this step, we usually use the values of 0.050, 0.010 and 0.001

Using the website indicated by the Professor, https://www.socscistatistics.com/tests/friedman/Default.aspx, we run the test with the Silhouette metric of the different algorithms and seeds and the results are in the following images

So, according to the test here presented there is no significant difference between the two models, or parametrized models, that we utilized in this experimentation meaning that any of the models should bring equivalent results. Probably, a tunning in the model should result in better models but this tunning is for next works.

	App	Category	Rating	Reviews	Size	Installs	Type	Price	Content Rating	Genres	Last Updated	Current Ver	Android Ver
133	Dresses Ideas & Fashions +3000	BEAUTY	4.5	473	8.2M	100,000+	Free	0	Mature 17+	Beauty	March 1, 2017	1.0.2.0	1.6 and up
10050	Advanced EX for NISSAN	TOOLS	2.9	164	144k	5,000+	Paid	$4.99	Everyone	Tools	March 14, 2015	1.3	1.6 and up
4941	ac remote control	TOOLS	3.7	9514	7.0M	500,000+	Free	0	Everyone	Tools	July 11, 2018	acremotecontrol-v7	4.0.3 and up
6527	BN Pro Battery Level-White	LIBRARIES_AND_DEMO	NaN	21	200k	5,000+	Free	0	Everyone	Libraries & Demo	February 5, 2017	2.3.2	1.6 and up
4986	Alchemy Classic Ad Free	FAMILY	4.6	20178	9.0M	100,000+	Free	0	Everyone	Puzzle	May 26, 2014	1.7.3	1.6 and up
2343	Migraine Buddy - The Migraine and Headache tra...	MEDICAL	4.7	26862	30M	500,000+	Free	0	Everyone	Medical	June 25, 2018	25.3.1	4.0.3 and up
3445	ai.type Free Emoji Keyboard	PERSONALIZATION	4.3	647844	Varies with device	10,000,000+	Free	0	Everyone	Personalization	July 24, 2018	Varies with device	Varies with device
5930	Let's Learn Alif Ba Ta	FAMILY	4.6	32	60M	10,000+	Free	0	Everyone	Education	November 29, 2017	6	4.0.3 and up
5862	Miami crime simulator	GAME	4.0	254518	100M	10,000,000+	Free	0	Mature 17+	Action	July 9, 2018	2.0	4.0 and up
3185	Fly Delta	TRAVEL_AND_LOCAL	3.7	27560	46M	5,000,000+	Free	0	Everyone	Travel & Local	July 31, 2018	4.13.2	5.0 and up

	Total	Percent
Rating	1463	0.151465
Size	1227	0.127032
Current Ver	8	0.000828
Android Ver	2	0.000207
Type	1	0.000104
Last Updated	0	0.000000

	Rating	Installs	Price	Reviews	label_rating
0	0.775	9.999000e-06	0.0	3.519580e-06	0 bad
1	0.725	4.999990e-04	0.0	2.151844e-05	0 bad
2	0.925	4.999999e-03	0.0	1.949335e-03	1 normal
3	0.875	5.000000e-02	0.0	4.803625e-03	1 normal
4	0.825	9.999900e-05	0.0	2.151844e-05	1 normal
5	0.850	4.999900e-05	0.0	3.697786e-06	1 normal
6	0.700	4.999900e-05	0.0	3.942820e-06	0 bad
7	0.775	9.999990e-04	0.0	8.200621e-04	0 bad
8	0.850	9.999990e-04	0.0	3.071836e-04	1 normal
9	0.925	9.999000e-06	0.0	2.673099e-06	1 normal
10	0.850	9.999990e-04	0.0	3.091661e-04	1 normal
11	0.850	9.999990e-04	0.0	1.957376e-04	1 normal
12	0.800	9.999999e-03	0.0	9.985805e-04	0 bad
13	0.900	9.999900e-05	0.0	9.634293e-05	1 normal
14	0.850	9.999900e-05	0.0	3.379242e-05	1 normal
16	0.925	4.999990e-04	0.0	8.088351e-05	1 normal
17	0.875	9.999000e-06	0.0	5.791714e-07	1 normal
18	0.825	4.999999e-03	0.0	4.326299e-03	1 normal
19	0.900	9.999999e-03	0.0	4.998650e-03	1 normal
20	0.750	9.999900e-05	0.0	1.000184e-05	0 bad
21	0.775	9.999900e-05	0.0	1.454611e-05	0 bad
22	0.925	4.999990e-04	0.0	1.714793e-04	1 normal
24	0.925	4.999900e-05	0.0	2.606271e-06	1 normal
25	0.950	9.999000e-06	0.0	4.254682e-06	2 good
26	0.925	4.999990e-04	0.0	4.512859e-04	1 normal
27	0.775	9.999900e-05	0.0	4.499716e-06	0 bad
28	0.725	9.999000e-06	0.0	3.007236e-06	0 bad
29	0.775	9.999900e-05	0.0	4.945232e-06	0 bad
30	0.800	9.999900e-05	0.0	2.492664e-05	0 bad
31	0.775	4.999900e-05	0.0	5.034336e-06	0 bad
...	...	...	...	...	...
10792	0.800	9.999990e-04	0.0	4.824943e-04	0 bad
10793	0.850	4.999990e-04	0.0	6.350614e-04	1 normal
10795	0.800	9.999900e-05	0.0	1.634600e-04	0 bad
10796	0.925	9.999990e-04	0.0	1.368716e-03	1 normal
10797	0.900	9.999990e-04	0.0	7.224495e-04	1 normal
10799	0.925	9.999900e-05	0.0	4.533130e-05	1 normal
10800	0.775	4.999000e-06	0.0	3.853717e-06	0 bad
10801	0.975	9.990000e-07	0.0	1.136067e-06	2 good
10802	0.750	9.999000e-06	0.0	4.098751e-06	0 bad
10803	0.825	9.999990e-04	0.0	1.258473e-03	1 normal
10804	0.800	9.999900e-05	0.0	1.212027e-04	0 bad
10805	0.825	9.900000e-08	0.0	4.455164e-08	1 normal
10809	0.925	9.999990e-04	0.0	8.380654e-03	1 normal
10810	0.950	9.900000e-08	0.0	4.009648e-07	2 good
10812	0.775	9.990000e-07	0.0	1.759790e-06	0 bad
10814	0.750	4.999900e-05	0.0	1.746424e-05	0 bad
10815	0.800	4.999990e-04	0.0	1.286206e-04	0 bad
10817	0.750	9.999900e-05	0.0	1.969183e-05	0 bad
10819	0.575	4.999000e-06	0.0	1.136067e-06	0 bad
10820	1.000	9.990000e-07	0.0	4.677923e-07	2 good
10827	0.800	4.999000e-06	0.0	2.583995e-06	0 bad
10828	0.600	9.999000e-06	0.0	6.459988e-06	0 bad
10829	0.900	9.999000e-06	0.0	1.341004e-05	1 normal
10830	0.700	9.999900e-05	0.0	1.960272e-05	0 bad
10832	0.700	9.999900e-05	0.0	2.659733e-05	0 bad
10833	0.950	9.990000e-07	0.0	9.578603e-07	2 good
10834	0.750	4.990000e-07	0.0	1.336549e-07	0 bad
10836	0.875	4.999000e-06	0.0	8.242054e-07	1 normal
10837	1.000	9.900000e-08	0.0	6.682747e-08	2 good
10840	0.875	9.999999e-03	0.0	8.872593e-03	1 normal

	DB	Silhouete	K	Seed	Algorithm
0	0.467329	0.520237	3	37	kmeans
1	0.755287	0.457037	3	110	kmeans
2	0.755287	0.505718	3	777	kmeans
3	0.755287	0.510779	6	AgglomerativeClustering	-
4	0.756716	0.489827	4	37	kmeans
5	0.756716	0.488631	4	110	kmeans
6	0.466315	0.529493	4	777	kmeans
7	0.466315	0.537279	6	AgglomerativeClustering	-
8	0.523393	0.508422	37	5	kmeans
9	0.535047	0.520671	5	110	kmeans
10	0.444454	0.488557	5	777	kmeans
11	0.444454	0.520593	6	AgglomerativeClustering	-
12	0.603907	0.518118	37	6	kmeans
13	0.603684	0.520023	6	110	kmeans
14	0.521372	0.527498	6	777	kmeans
15	0.521372	0.485241	6	AgglomerativeClustering	-
16	0.603754	0.514272	37	7	kmeans
17	0.836479	0.521544	7	110	kmeans
18	0.521853	0.500308	7	777	kmeans
19	0.521853	0.523872	6	AgglomerativeClustering	-
20	0.898455	0.483838	37	8	kmeans
21	0.837333	0.503442	8	110	kmeans
22	0.772824	0.513642	8	777	kmeans
23	0.772824	0.474615	6	AgglomerativeClustering	-
24	0.793047	0.530030	37	9	kmeans
25	0.792869	0.530092	9	110	kmeans
26	0.792869	0.548165	9	777	kmeans
27	0.792869	0.508424	6	AgglomerativeClustering	-
28	0.793047	0.496616	37	10	kmeans
29	0.715091	0.523217	10	110	kmeans
30	0.792836	0.550425	777	10	kmeans
31	0.792836	0.527656	6	AgglomerativeClustering	-
32	0.792836	0.537694	37	11	kmeans
33	0.791581	0.504579	11	110	kmeans
34	0.800316	0.512558	777	11	kmeans
35	0.800316	0.536532	6	AgglomerativeClustering	-
36	0.792836	0.533943	37	12	kmeans
37	0.993257	0.520890	12	110	kmeans
38	0.792100	0.524566	777	12	kmeans
39	0.792100	0.535201	6	AgglomerativeClustering	-
40	0.763169	0.578027	37	13	kmeans
41	0.846885	0.582054	13	110	kmeans
42	0.764156	0.542092	777	13	kmeans
43	0.764156	0.527338	6	AgglomerativeClustering	-
44	0.800050	0.539485	37	14	kmeans
45	0.990822	0.531511	110	14	kmeans
46	0.763154	0.563354	777	14	kmeans
47	0.763154	0.570020	6	AgglomerativeClustering	-
48	0.932861	0.515766	37	15	kmeans
49	0.846900	0.516821	110	15	kmeans
50	0.761888	0.561737	777	15	kmeans
51	0.761888	0.544145	6	AgglomerativeClustering	-
52	0.799614	0.533166	37	16	kmeans
53	0.931410	0.555577	110	16	kmeans
54	0.539110	0.843843	777	16	kmeans
55	0.843843	0.575253	6	AgglomerativeClustering	-
56	0.931123	0.569303	37	17	kmeans
57	0.934003	0.520858	110	17	kmeans
58	0.843843	0.555921	777	17	kmeans
59	0.843843	0.568797	6	AgglomerativeClustering	-